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T 



. he dawning of a new decade is an appropriate 
time to reflect on the tremendous progress that has been 
made in human genomic research. In 2010, with whole- 
genome sequencing becoming increasingly affordable, 
the promise of large-scale human genomic research stud- 
ies involving hundreds, thousands, and even hundreds of 
thousands of individuals is rapidly becoming a reality. 
The next generation of human genomic research will 
occur on a scale that would have been nearly unfath- 
omable at the start of the last decade, when the publica- 
tion of the Human Genome Project's first draft results 
was still pending. 



The cost of a diploid human genome sequence has dropped from about $70M to $2000 since 2007— even as the standards 
for redundancy have increased from 7x to 40x in order to improve call rates. Coupled with the low return on investment 
for common single-nucleotide polylmorphisms, this has caused a significant rise in interest in correlating genome sequences 
with comprehensive environmental and trait data (GET). The cost of electronic health records, imaging, and microbial, 
immunological, and behavioral data are also dropping quickly. Sharing such integrated GET datasets and their interpre- 
tations with a diversity of researchers and research subjects highlights the need for informed-consent models capable of 
addressing novel privacy and other issues, as well as for flexible data-sharing resources that make materials and data avail- 
able with minimum restrictions on use. This article examines the Personal Genome Project's effort to develop a GET data- 
base as a public genomics resource broadly accessible to both researchers and research participants, while pursuing the 
highest standards in research ethics. 
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When the Human Genome Project published its draft 
results on June 26, 2000, it published a compound human 
genome sequence containing genetic information from sev- 
eral volunteers. Seventy percent of the final sequence was 
obtained from one anonymous individual, while the remain- 
ing 30% came from a number of different individuals. From 
the first amalgamated human genome sequence — which 
was refined in 2003 and continues to be updated and refined 
to this day — private and public research efforts have gone 
on to sequence numerous individual human genomes with 
increasing speed and detail and decreasing time and cost. 
The acceleration of whole-genome sequencing in the 
research context necessitates new perspectives and models 
that enable scientists and society to learn as much as pos- 
sible from this rapidly expanding dataset while still respect- 
ing important ethical, legal, and social norms. 
The Personal Genome Project (PGP), 1 an ambitious 
research study directed by faculty members in the 
Department of Genetics at Harvard Medical School, aims 
to recruit as many as 100 000 informed participants to con- 
tribute genomic sequence data, tissues, and extensive envi- 
ronmental, trait, and other information to a publicly acces- 
sible and identifiable research database. 
In this review we describe the Personal Genome Project 
itself, focusing on its unique structural features and the 
rationale behind the project's design. We also elucidate the 
changing scientific and social landscape that makes the 
PGP's model of open consent and public data access 
increasingly important to the furtherance of human 
genomic research. 



The PGP's mission 

In contrast to research studies that focus on small sub- 
sets of traits within narrowly defined human populations 
exhibiting single diseases, the PGP was conceived with 
an expansive mission. From the outset, the mission of the 
project (Table I) has been to develop a broad-based, lon- 
gitudinal, and participatory research study that will facil- 
itate a comprehensive understanding of the project's 
participants at the genomic level and beyond. 
The PGP is constructed with the recognition that our 
desire to truly understand the genesis of most complex 
human traits — from dread diseases to the talents and 
quirks that make us each uniquely human — could only 
be satisfied by examining genomic information in con- 
text and by surrounding it with the richest possible data 
from the widest possible array of supplemental sources. 
By supplementing genomic sequence data with the col- 
lection and analysis of tissues and extensive environ- 
mental and trait data, and by making these data publicly 
accessible to researchers worldwide, the PGP aims to 
improve understanding of the ways in which genomes 
plus environments ultimately equal traits (Figure 1). 
The PGP is more than just a research repository. In addi- 
tion to its publicly accessible research database, the PGP, 
which is supported by the nonprofit PersonalGenomes.org, 
also works to disseminate genomic technology and knowl- 
edge at a global level, thereby producing tangible and 
widely available improvements in the understanding and 
management of human health and disease. The PGP also 



The Personal Genome Project's Mission Statement 

The mission of the Personal Genome Project is to encourage the development of personal genomics technology and practices that: 

• are effective, informative, and responsible 

• yield identifiable and improvable benefits at manageable levels of risk 

• are broadly available for the good of the general public 

To achieve this mission we will build a framework for prototyping and evaluating personal genomics technology and practices at increasing 
scales. In support of this goal, we will: 

• develop a broad vision for how personal genomes may be used to improve the understanding and management of human health and 
disease 

• provide educational and informational resources for improving general understanding of personal genomics and its potential 

• recruit individuals interested in obtaining and openly sharing their genome sequences, related health and physical information, and 
reporting their experiences as a participant of the project on an ongoing basis 

• develop technologies to improve the accessibility of personal genome sequencing 

• foster dialog with research communities, industries, and public and governmental bodies with interests in personal genomics, and related 
ethical, legal, and social issues (ELSI) 

• develop tools for interpreting genomic information and correlating it with personal medical and biological information 

Table I. PGP's Mission Statement, available at: htttpV/www.personalgenomes.org/mission.htm]. 1 
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finds itself at the forefront of discourse surrounding the 
ethical, legal, and social issues (ELSI) associated with 
large-scale whole-genome sequencing, particularly in the 
areas of privacy, informed consent, and data accessibility. 
The PGP is, and is intended to be, a research project that 
is constantly in progress, exploring the boundaries of 
human genomic research in a way that produces maxi- 
mal advances in scientific understanding and public 
understanding and well-being, while striving to reach 
beyond what is minimally required to satisfy its ethical, 
legal, and social obligations to its participants. In the sec- 
tions that follow we report on unique aspects of the PGP 
relating to technology development, integrative 
genomics, and human subject research protocols, as well 
as describe the development and current state of the 
PGR 

Key developments in 
human genome sequencing 

The PGP derives its impetus and importance from his- 
toric breakthroughs in understanding and analysis of 
DNA. DNA comprises only a very small fraction of a 
cell (-3% dry weight E. coli), and its role as the mole- 
cule primarily responsible for transmission of genetic 
traits was not recognized until a series of discoveries 
beginning in the 1940s. The emergence in 1953 of a clear 
concept of DNA as a double-helical structure compris- 
ing a pair of complementary strings of four elementary 
bases (the nucleotides A, C, G, and T) crystallized inter- 
est in determining the DNA sequences of genes and the 
sequence differences responsible for disease, and set the 



stage for over four decades of development of ever more 
efficient and comprehensive sequencing methods. Table 
II describes this history by a set of milestones that take 
one from the early beginnings of DNA sequencing up 
through delivery of draft human genome sequences in 
2001 to 2003. In the 38 years between 1965, when Robert 
Holley and colleagues at Cornell and the US 
Department of Agriculture sequenced a 77 nt RNA 
gene after 4 years of effort, and 2003, when the public 
Human Genome Project (HGP) declared that it had met 
its goals regarding delivery of a ~3Gbp human genome 
sequence, the size of DNA sequence that could be 
accommodated by sequencing technology improved -30 
million-fold. 

Post-HGP sequencing — towards 
whole diploid genomes 

Notably, the HGP had delivered only a single human 
genome sequence that was a composite built from a small 
number of deidentified individuals, while the competing 
nonpublic human genome project merged in data from an 
identified individual (Craig Venter); both were haploid 
estimates. As recognized from the beginning of the HGP, 
many additional resources would be needed to under- 
stand the functions of the genes laid out in these "refer- 
ence" human genomes, and to identify the sequence dif- 
ferences between individuals that contribute to individual 
traits, health, and disease. Indeed, as the HGP ended, pro- 
jects were already under way to identify large numbers of 
genetic differences from the HGP-derived reference 
genome in different human populations that could sub- 



Personal 
genome 
6 Gbp 
3M+ alleles 



Personal 
stem cells 

and 
epigenome 




^ ( Envirome 



TRAITS 
(phenome) 




Figure 1. Genome + Environment = Traits (GET) equation. Envirome: the totality of environmental influences; VDJ-ome: the DNA sequences of the entire 
repertoire of an individual's immunoglobulin and T-cell receptors, which reflect a lifetime of antigenic exposures; Microbiome: the billions of 
commensal, symbiotic, and pathogenic micro-organisms that share our body space; Epigenome: the totality of programmed biochemical and 
structural modifications to genomic DNA that regulate organism or phenotype development, (see overview in Table III). 
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sequently be analyzed using low-cost array methods in 
large numbers of individuals, a strategy that has since 
given rise to more than 480 published genome-wide asso- 
ciation studies. 1617 At the same time, however, interest was 
rising in the second approach: to significantly improve 
DNA sequencing technology to a point where an indi- 
vidual's entire genome could be sequenced at very low 
cost. A combination of two kinds of arguments were 
advanced supporting this approach, focusing on functional 
utility and economics, respectively. 
The gist of the functional arguments was that sequenc- 
ing of individuals is intrinsically more informative and 
flexible than array-based interrogation of known sites of 
variation and that, variation aside, any improvements in 
sequencing cost and capability could be quickly applied 
to numerous general aspects of biology that are critical 
to understanding gene function, traits, and health and 
disease. 1819 The relative advantages of sequencing have 
long been recognized. Unlike array analyses, sequenc- 
ing: (i) does not require variations to be preidentified; 
(ii) can more readily accommodate more complex vari- 
ations than single nucleotide changes and very short 
inserts or deletions; and (iii) need not focus on variations 
that are common in large populations vs rare or unique 
variations. In consequence, as sequencing technology has 
improved, it has increasingly been integrated into asso- 
ciation studies of variation. 20 23 

However, these advantages of sequencing were coun- 
terbalanced by their high cost, a situation well illustrated 



by the $3 billion US cost of the HGP itself. It is here that 
economic arguments were advanced suggesting that dra- 
matic improvements in sequencing were feasible that 
might ultimately enable an individual's genome to be 
sequenced for 1000 to 10 000 USD. 18 On an empirical 
level, sequencing technology has appeared to exhibit a 
historical trend of exponentially decreasing costs with 
time as measured by sequenced base pairs per dollar at 
a given error rate, a situation frequently compared with 
"Moore's Law" in computing, 24 which noted that com- 
puting power measured by the integrated circuit tran- 
sistor density doubled roughly every 2 years at constant 
cost (Figure 2). 1825 To get genome sequencing costs down 
to $1000 would require cost and throughput improve- 
ments of an additional 4 to 5 orders of magnitude, so the 
question of economic feasibility ultimately turned on 
whether new methods could enable this very large 
improvement. 

Here, the HGP again gave grounds for optimism, for 
even though the HGP itself only achieved 100-fold 
improvements, it achieved this largely by refining, minia- 
turizing, and robotically scaling up, but not fundamen- 
tally changing, a Sanger sequencing method initially 
developed over 20 years earlier (Table II). If such meth- 
ods were capable of 100-fold improvement, considerably 
greater improvements might be expected from more rad- 
ically changing sequencing chemistry, signal generation 
and detection, and instrumentation in ways that could 
integrate some of the vast advances in chemistry and 



Date 


Event 


Size of sequence (bp) 


Reference 


1957 


First sequence mutation identified responsible for disease 


1 amino acid 
(sickle cell vs normal hemoglobin) 


(Ingram 1957 2 ) 


1965 


First sequence of a single complete gene 


77 bases 


(Holley, Apgar et al 1965 3 ) 


1976-1977 


Sequencing of first viral genomes 


3562 bases (MS2 RNA phage) 
5375 bases (cp X174 DNA phage) 


(Fiers, Contreras et al 1976"; 
Sanger, Air et al 1977 5 ) 


1975-1977 


Maxam/Gilbert and Sanger DNA sequencing methods 




(Sanger and Coulson 1975 s ; 
Maxam and Gilbert 1977 7 ; 
Sanger, Nicklen etal 1 977 s ) 


1994 


First commercial bacterial genome sequence 


1.7Mbp (Helicobacter pylori) 


(Nature Genetics, May 1 996 s ) 


1995 


First published bacterial genome sequence 


1.83Mbp (Haemophilus influenzae) 


(Fleischmann, Adams et al 1995 10 ) 


1998-2000 


Genome sequences of first animals 


100Mbp (Caenorhabditis elegans) 
120Mbp (Drosophila melanogaster) 


(C. elegans Sequencing 
Consortium 1998," Adams, 
Celniker et al 2000") 


2001 


Two draft sequences of human genome 


~3Gbp 


(Lander, Linton et al 2001," 
Venter, Adams et al 2001") 


2003 


Completion of public Human Genome Project 




(Collins, Morgan et al 2003 15 ) 





Table II. Development of DNA sequencing. 
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enzymology, optics and electronics, materials science, 
microfabrication, and process control that had accrued 
over the preceding 20 years and been put to good use in 
many other fields. The HGP also directly provided an 
important resource for realizing this strategy: the refer- 
ence human genome sequence itself, as this could serve 
as a template against which reads obtained by new tech- 
nologies could be located, allowing new human genomes 
to be assembled at least initially by "resequencing" vs de 
novo assembly. This reduces the burden on new sequenc- 
ing methods by allowing them to generate useful data 
with shorter reads and higher base call error rates than 
would generally be needed for de novo assembly, 
although de novo assembly of genomes using new 
sequencing technology remains an important goal. 

Next-generation sequencing 

Researchers were quick to work out sequencing 
approaches along the lines indicated in these arguments, 
and commercial products emerged soon, giving rise to 
next-generation sequencing (NGS). Soon granting agen- 
cies promised funding for support, and a ~10M USD 
competition was announced for rapid, accurate genomic 
sequencing, generating increased coalescence around 
target goals for dramatic improvements to sequencing 
technology. 262728 Detailed reviews and comparisons of 
NGS approaches have been published. 182930 
Among the earliest NGS methods were polony sequenc- 
ing (the Polonator) and 454 Life Sciences. 313233 Both meth- 
ods amplify DNA templates onto microbeads that are 
packed onto two-dimensional arrays for sequencing, 
thereby achieving enormous economies of scale com- 
pared with Sanger sequencing, and each achieved -25- 
fold better cost per bp compared with HGP (Figure 2). 
However, each uses different sequencing chemistry and 
arraying technology, giving rise to many technical trade- 
offs. Together they proved the general point that great 
improvements in sequencing efficiency were indeed 
within reach, but also that the precise character and 
degree of improvement would depend closely on the 
novel technologies employed and the ingenuity with 
which they could be integrated. A second wave of devel- 
opment introduced methods by Illumina and ABI that, by 
very different means, have improved the utility and costs, 
(Figure 2) 3435 and hence use of these systems is becoming 
widespread for both large scale and "deep" sequencing 
applications, and both are under continuous development. 



Two complete cancer genomes were recently sequenced, 
one with each platform. 3637 Further rounds of innovation 
have yielded a diverse set of newer NGS methods. For 
instance, a number of "single-molecule" sequencing meth- 
ods are now available or in development. These methods 
avoid the need to make thousands to millions of copies of 
DNA template molecules on microbeads or surfaces to 
assure that sequencing operations generate sufficient sig- 
nal to read individual bases accurately, and instead use 
highly sensitive optics to detect bases at the single mole- 
cule level; this allows even denser packing of DNA tem- 
plates and further efficiencies in sequencing chemistry. 
While Helicos Biosciences has commercialized a single- 
molecule system that simply arrays single template mol- 
ecules on a surface and uses sequencing cycle similar to 
the methods above, Pacific Biosciences is developing a 
system in which enzymes and templates are tethered to 
the bottom of nanofabricated wells and which monitors 
the signals generated by sequencing chemistry in real- 
time vs artificial cycles. 3839 Here, the nanofabricated wells 
enable substantially increased accuracy of single molecule 
base incorporation events. Finally, on another track, the 
company Complete Genomics, Inc has developed a 
method whereby very compact self-assembling amplicons 
of template DNAs called "nanoballs" are flowed onto a 
nanofabricated grid of ~300nm spots at 700 to 1300 nm 
center-to-center distances. Three complete human 
genomes were sequenced with this method (as of January 
2010) with an average consumable cost of $4400 and as 
low as $1500 for 40X coverage. 40 
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Figure 2. Exponential trend of sequencing costs in base pairs per USD 
(bp/$), a trend often compared with Moore's Law (see text). See 
ref 25 for details. 
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Towards affordable personal genomes 

These developments suggest that technology capable of 
meeting the cost target of $1000 or less for a diploid 
human genome sequence is within reach. Indeed, the in- 
depth resequencing of individual human genomes has now 
been demonstrated several times by NGS developers to 
demonstrate that their methods have come of age. There 
are now published full genome sequences for at least 
seven individuals, 40 with some having been sequenced by 
more than one method. There are also tens — and perhaps 
hundreds — of additional unpublished or partly published 
genomes (see, eg, refs 36,37),while the lower-coverage 1000 
Genomes Project 20 21 continues. Clearly, the age of personal 
genomics is now close at hand. 

The PGP 

As described in the first section, one of the PGP's central 
aims is to develop a publicly available, fully consented 
database containing comprehensive human genome and 
phenome data for its research participants. Such inte- 
grated datasets are fundamental drivers of progress in 
functional genomics and enable systems biology-based 
insights into the mechanisms of human health and dis- 
ease. 41 PGP studies will look beyond inherited genomes 
to include somatic and epigenetic variation data, as well 
as relevant microbiome, transcriptome, immunity-reflect- 
ing "VDJ-ome" and phenome data to develop compre- 
hensive profiles. By developing high-resolution data pro- 
files for each participant, and multiplying that by a large 
(up to 100 000) participant population, the PGP will also 
generate valuable data describing the kinds and distrib- 
utions of variation that exist in populations. Although an 
improved understanding of human health and disease is 
a central aim of the PGP, its focus is considerably broader 
and will enable research into the social and behavioral 
sciences using personal genomic data. Finally, the PGP's 
flexible study protocol and public and distributed 
approach to research enables it to keep pace with 
sequencing and other technological advances while 
simultaneously driving these developments. 

Integrated personal genomes: inherited, somatic, envi- 
ronmental genomics 

If the PGP is to fulfill its mission to address the multidi- 
mensional complexity of human biology, it must encom- 



pass multiple interacting "-omes." For example, a per- 
son's diet will have a profound influence upon her or his 
somatic gene expression as well as the genomic and pro- 
teomic activity of the person's microbiome. It will also 
affect the metabolome. Similarly, an individual's envi- 
ronmental exposures to pollutants will have a direct 
bearing on her or his immunological response and there- 
fore, on the VDJ-ome. Germline alleles will affect how 
one metabolizes drugs, which will have myriad effects on 
an individual's physiological and behavioral phenotypes. 

Genomes (vs exomes) 

In its early phase, given the then-current cost of genomic 
sequencing, the PGP planned to focus on exomes rather 
than whole genomes as a way to affordably expand the 
project to large numbers of participants. Despite repre- 
senting only 1% to 2% of the 6 billion base pairs in a 
human genome, the exome contains all protein-coding 
exons and therefore provides access to the majority of 
known functional variants. 4849,50 However, continued 
improvements in genomic sequencing have produced 
price declines that have rendered whole-genome 
sequencing significantly cheaper per base pair than 
exome sequencing. The PGP, as a result, has determined 
that whole-genome sequencing is cost-justified given the 
relatively high price of exomes and the additional infor- 
mation supplied by whole-genome sequences of PGP 
participants. 51 See also Table III for the various "omes." 

Phenomes 

Detailed phenotype data is required to categorize and, 
ultimately, understand the phenotypes that the PGP 
seeks to explore. However, the vastness of the human 
phenome, defined as the physical totality of human traits 
at all levels, from the molecular to the behavioral, will 
require new strategies that permit high-throughput trait 
collection while yielding accurate and standardized phe- 
notypic data. With regard to the cellular and molecular 
phenotypes, the PGP collects participant tissue samples 
and develops cell lines that are then deposited and pub- 
licly accessible through established biobanks. 5253 
As the PGP expands it is exploring Web-based, high- 
throughput behavioral phenotype data-collection mod- 
els pioneered by leading public and private researchers. 
While the reliability and validity of self-reported traits 
is a concern, particularly for phenome research con- 
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ducted online, 5455 Web-based assessments provide dis- 
tinct opportunities for "dynamic phenotyping" based on 
a particular individual's prior genotype-phenotype asso- 
ciations. 56 The multimodal capabilities of Web-based trait 
collection instruments, combined with their low cost of 
implementation at large scales, seem likely to accelerate 
the ability of studies like the PGP to effectively explore 
new corners of the human phenome. 
The PGP is also taking advantage of recent advance- 
ments in health information technologies to assist par- 
ticipants and researchers alike in structuring and access- 
ing the massive amounts of personalized data generated 
by the project. The emergence of online Personally 
Controlled Health Record (PCHR) platforms and other 
novel tools enables individuals to collect and manage 
their own health data — including health history, med- 
ication, allergy, immunization, biometric and other data 
types 57 58 59 — and can be developed for integrated data 
entry, access and dissemination by both the individual 
and third-party researchers or data providers, including 
health care providers. 

Enviromes 

The picture of genome and phenome is incomplete with- 
out the envirome. The envirome can be described as the 
totality of equivalent environmental influences con- 
tributing to all disorders and organisms. 60 The mode of 
response of an organism to the environment that is 
reflected in its phenotype is constrained by its unique set 
of genetic variations and the environmental influences on 
gene expression. Therefore, a comprehensive approach is 
required to describe the envirome systematically in con- 



junction with genome and phenome information. The rel- 
evant envirome data is too large and complex to be 
reported, managed, or analyzed manually. The creation of 
phenome-genome and genome-envirome networks has 
been suggested in order to relate phenome and envirome 
information to potential disease-associated genes. 61 

Microbiomes 

Even though microbial cells are estimated to outnumber 
human cells in a single individual by a factor of ten, we 
know very little about the microbes that live in and on us, 
including what mixture of bacteria, viruses, and other 
micro-organisms constitute a "normal" human micro- 
biome and how those organisms impact different biolog- 
ical states. 62 Major efforts such as the Human Microbiome 
Project are under way to characterize the microbiota at 
different body sites in humans and to assess how variation 
in microbial communities is associated with states of 
health and disease. 63 The PGP takes advantage of the 
unique availability of comprehensive participant profiles 
and uses them to explore interactions between host 
genetic and phenotypic variability alongside the genomic 
variation in the microbes that colonize them. 64 

The VDJ-ome 

The Church Lab at Harvard Medical School is develop- 
ing techniques for characterizing the repertoires of B- 
and T-cell receptors in individual humans from blood 
samples and correlated across time with personal expo- 
sure histories, with an ultimate goal of characterizing 
individuals repertoires of linked VDJ and VJ sequences. 



Personal genome: Entire diploid human genome of a single individual representing 6 billion base pairs. 
Exome: All exons, representing 1% to 2% of the entire human genome. 

Phenome: Set of all traits in an organism, at all levels, or one of its subsystems, including morphology, physiology, and behavior. 4143 
Envirome: The totality of equivalent environmental influences contributing to all disorders and organisms. 44 

Microbiome (human): The ecological community of commensal, symbiotic, and pathogenic microorganisms that share our body space. 45 
VDJ-ome: The repertoire of rearranged V, D, and J genome segments present in an individuals's B and T immune cells at any given time (see 
Table IV). 

Transcriptome: The set of all RNA molecules, including mRNA, rRNA, tRNA, and noncoding RNA produced in one or a population of cells. 46 
Epigenome: The totality of programmed biochemical and structural modifications to genomic DNA that regulate organism or phenotype 
development. 

Metabolome: Total set of metabolites generated by an organism, or subsystem. 

Proteome: The entire set of proteins expressed by a genome, cell, tissue or organism at a given time under defined conditions. There are 
more proteins than genes. 47 

Table III. The "omes." 
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These techniques will be directly applicable to PGP par- 
ticipants and their self-reported data, and will yield a 
database of unprecedented depth describing the diver- 
sity and time development of human immune responses 
of large numbers of individuals in their life contexts. 

The adaptive immune system 

The adaptive immune system enables individuals to respond to 
their unique exposure histories to pathogens and environmental 
antigens, and possibly to cancerous mutations in their own cells, by 
generating and modulating expression of >10 12 unique antibodies 
from B cells and T cell receptors. 65 Antibody diversity derives from 
programmed stochastic rearrangements in maturing B cells of -40 
V, 23 D, and -5 J functional genomic segments into VDJ heavy 
chains, and -35 V and -5 J segments into VJ light chains (k or A.) in 
B cells, that are further randomized by somatic hypermutation; a 
similar process occurs in T cells. 66 NGS methods are now allowing 
researchers to identify and analyze expressed VDJ sequences in 
depth. 67 

Table IV. The adaptive immune system and the VDJ-ome. 
Tissue reprogramming 

The PGP also applies advances in tissue reprogramming 
techniques to tissue samples collected from PGP partic- 
ipants. Cells from collected somatic tissues are repro- 
grammed into induced pluripotent stem (iPS) cells 68 and 
made to differentiate into the cell types that are targeted 
for functional analysis. These methods enable experi- 
mental access to diverse tissue types that would other- 
wise be unobtainable from human subjects but are rou- 
tinely analyzed in model organisms, and thus, PGP 
participants can effectively serve as human model organ- 
isms. By examining multiple cell types from a single indi- 
vidual, differences in physiological states within and 
between tissues can be compared within a single PGP 
participant and/or across the entire PGP cohort. This 
approach also permits researchers to elucidate connec- 
tions between genetic variation and variation in other 
molecular traits, such as gene expression or epigenetic 
modifications. 69 Stored fibroblast cell lines provide 
researchers with access to renewable supplies of differ- 
ent tissue types from PGP participants. 

The PGP: from personal to public genomes 

The potential benefits arising from large-scale and inte- 
grated human genomic datasets are immense.™ The util- 



ity of such research, however, depends upon the respon- 
sible development and widespread availability of such 
comprehensive datasets, which in turn depends on 
describing and addressing the various ethical, legal and 
social challenges. Those challenges include a standard set 
that are inherent to any research involving human sub- 
jects, as well as certain challenges that are unique to 
"public genomics" 71 research involving publicly available, 
identifiable whole-genome sequence data, such as the 
model pioneered by the PGP. We use the term "public 
genomics" to denote research studies that possess the 
following three critical attributes. 

Integrated data 

The various data types, including genomic and phenomic 
or trait data, are accessible in a linked format, such as a 
PCHR or other integrated data structure. Through this 
explicit linkage of data it is possible to ascertain the 
complete list of available traits and genetic variants for 
any given participant. Integration also facilitates partic- 
ipant-researcher interactions, longitudinal study and 
recontact and, crucially, simultaneous investigation of the 
full range of complex trait associations. Although par- 
ticipants need not be explicitly identified, integrated data 
sets that include both genomic and phenomic data will 
be identifiable in most cases. For this reason, participants 
must be made explicitly aware of the probability that 
they will be identified with their publicly available data, 
rendering promises of perfect privacy, anonymity, or con- 
fidentiality impermissible within the public genomics 
model. However, the promise of privacy need not give 
way to a promise of publicity. 

Open access 

Data sets and tissues are made publicly available with 
minimal or no access restrictions (including researcher 
qualifications and cost), and are generally transferable 
outside the original research study to be utilized by and 
combined with data from third parties. Well-developed 
data structures and intellectual property licenses are 
important components of this characteristic. Developing 
datasets that are not only publicly available but also eas- 
ily portable fosters the development of a genomic com- 
mons, allows data validation by third parties, and enables 
the use and application of data in novel contexts that 
may not be foreseeable at the time of collection, thereby 
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facilitating hypothesis generation, encouraging serendip- 
ity and broadening the genomic research community. 

Voluntary and informed participation 

Satisfaction of the first two criteria publication of an inte- 
grated dataset in an open-access format necessitates that 
a premium be placed on receiving truly voluntary and 
informed consent from participants in public genomics 
research projects. Given the yet-unknown outcomes and 
the potential personal, familial, and social risks associated 
with such research, enrollment is only acceptable under 
an informed consent protocol that is specially designed 
to meet the highest standards of human research subjects 
protection in view of these conditions. 

The study protocol 

The PGP aims to produce public genomics research — 
and to develop and evaluate associated technologies 



and research — on a large and expanding scale. In 
October of 2008, the PGP published the first inte- 
grated set of DNA sequences, traits, and tissues col- 
lected from ten participants (the "PGP-10") enrolled 
in a pilot study initiated in 2005. Today, the PGP is 
incrementally expanding its cohort toward 100 000 
participants. More than 12 000 individuals had regis- 
tered to participate in the PGP as of February 2010. In 
the following section we highlight significant features 
of the PGP study protocol as it is implemented for the 
enrollment of the first 100 participants ("PGP-100") 
and summarized in Table V. 

Public genomes: adding to ELSI 

The practice of public genomics poses its own challenges, 
especially for the organization and governance of human 
subjects' research, forcing us to critically reassess current 
frameworks and practices. In order to pursue innovative 
research in a responsible manner, the PGP has devel- 



Eligibility • Review and sign "mini-consent" form. 

screening • Eligibility questionnaire about family circumstances and privacy preferences. 

• Entrance exam to ensure informed consent; includes potential risks of participating, project protocols, and basic 
genetics. 

• Review of full PGP consent form. 

• Submit information or delete account. 



Pre- • Consent to participate, 
enrollment • Collection of baseline trait data via questionnaire and a personal health record. Includes allergies, 

immunizations, medical history, medications, physical traits and measurements, diet, ethnicity/ancestry, lifestyle, 
and environmental exposures. 

• Participants asked to make a financial pledge (does not impact enrollment decisions). 

• Identity verification and provision of mailing address. 

• Submission of application for enrollment. Individuals selected to continue the enrollment process will receive an 
enrolment kit by mail, including saliva collection materials. 



Enrollment 



Ongoing 
participation 



• Participants may be interviewed by one or more PGP staff to verify identity and consent, confirm familiarity with 
study protocols, and/or review trait questionnaire responses. Blood samples, saliva sample, and/or skin cells may 
be collected. 

• Tissue samples prepared for DNA sequencing and other biological analyses. 

• Participants opt-in to have their profiles made available on a publicly accessible Web site, or withdraw from the 
study. 

• Establishment, distribution and analysis of cell lines for research. 



• Information collected for 25 years. Participants can leave the study at any time. 

• Data Safety Monitoring Board monitors the impacts of the PGP on enrolled participants. Quarterly emails 
inquire about adverse events. 

• Additional trait data and tissue samples may be requested periodically. 



Table V. Overview of PGP study protocol. 

Adapted from ref 52: Angrist M. Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Pers Med. 2009;6:691-699. 
Copyright © Future Medicine 2009 
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oped a number of project-specific tools and resources 
relevant to ELSI. 

Open consent 

The "open consent" model developed by the PGP is 
designed to address the set of challenges associated with 
the creation of datasets where it may be possible to iden- 
tify individual participants with their genomic and other 
data. The open consent model assumes that, in such a 
context, conventional assurances of anonymity, privacy 
and confidentially are impossible and should not serve 
as any part of the foundation for the informed consent 
protocol. 7273 Due to the structure of public genomics pro- 
jects such as the PGP, and their associated datasets, while 
privacy and confidentiality can be protected they cannot 
and should not be guaranteed to participants. This prac- 
tice ensures veracity, which we regard as a necessary — 
though not sufficient — prerequisite for the exertion of 
substantive autonomy. It is only through veracity that the 
criteria underlying truly informed consent can be satis- 
fied. 

Open consent is therefore based on complete openness 
and transparency with regard to all aspects of participa- 
tion, including the potential for reidentification and the 
reality that there may be other risks that are unidenti- 
fiable at the time of consent. Predicting all potential risks 
is by definition impossible and even a list of known pos- 
sible risks is unlikely ever to be comprehensive. 



Data sharing — and the risks of public genomes 

The PGP's informed consent process begins with an 
extensive pre-enrollment educational examination 
designed to ensure a potential participant's ability to 
understand the specific nature of the data collected and 
the risks presented by public genomics research. For indi- 
viduals who demonstrate the needed proficiency, the spe- 
cific informed consent agreement that follows includes a 
lengthy but "noncomprehensive list of hypothetical sce- 
narios that could pose risks" for participants and their 
families (Table VI). Participants are warned that "the com- 
plete set and magnitude of the risks that the public avail- 
ability of [your genomic data] poses to you and your rel- 
atives is not known at this time." It is crucial that 
participants understand that once identifying genetic and 
trait data and tissues are released into the public domain 
for the express intent of broad dissemination and use by 
third parties it will be, in all likelihood, impossible to effect 
a meaningful retraction at a later date. 
The PGP's informed consent agreements and broader 
study protocol are developed in continuous close interac- 
tion with the Harvard Medical School Committee on 
Human Studies. The project is also overseen by an inde- 
pendent Data Safety Monitoring Board. Removing poten- 
tially disingenuous promises of anonymity, privacy, and 
confidentiality, while seeking to comprehensively and 
openly describe both known and unknown risks of partic- 
ipation, helps to ensure that research participants are as 



Potential risks of participation in the PGP as described in the consent form (Abbreviated) 

• The risks of public disclosure of your genetic and trait information could affect your employment, insurance and financial well-being and 
social interactions for you and your family. 

• Anyone with sufficient knowledge and resources could take your DNA sequence data and/or posted trait information and use that data, 
with or without modification, to: (i) infer paternity or other features of your genealogy; (ii) claim statistical evidence that could affect your 
employment, insurance or ability to obtain financial services; (iii) claim relatedness to criminals or incriminate relatives; (iv) make synthetic 
DNA and plant it at a crime scene, or otherwise use it to falsely identify you; or (v) reveal the possibility of a disease or unknown propensity 
for a disease. 

• Whether or not it is lawful to do so, you could be subject to actual or attempted employment, insurance, financial, or other forms of 
discrimination or negative treatment on the basis of the public disclosure of your genetic and trait information by the PGP or by a third party. 

• The distribution of your cell lines could result in the creation and further distribution by a third party of additional cell lines, organs, or 
tissues containing your DNA for research, commercial, clinical, or other uses, including certain forms of assisted reproduction, some of which 
you may find objectionable or upsetting. 

• If you have previously made available or intend to make available genetic information in a confidential setting, for example in another 
research study or in a clinical trial, the data that you provide as part of the PGP may be used, on its own or in combination with your 
previously shared data, to identify you as a participant in otherwise confidential genetic research or trials. 

Table VI. Potential risks of participation. 
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informed as possible about the nature of public genomics 
research and, simultaneously, safeguards the trustworthi- 
ness of scientists and of scientific research in general. 

Return of research data to participants 

Research volunteers have been traditionally treated as 
"objects" of study who have no intrinsic rights to the 
data generated by their participation. 74 Today, we see 
that study participants are increasingly asking for access 
to their data 75 and that available information and com- 
munication technologies have turned the return of 
research results into a feasible option. While some 
researchers adhere to the traditional viewpoint that 
research subjects should not or cannot receive identifi- 
able research data, some have suggested legal and ethi- 
cal grounds for finding that researchers possess the 
obligation to inform their participants of certain results, 
particularly when they are clinically actionable. 76 
However, defining the scenarios in which research 
results should be reported — and how to report such 
results — remains a challenging issue. The medical, finan- 
cial, and psychosocial risks of disclosing variants of 
known and unknown clinical significance require that a 
careful distinction be made between those variants in 
which convincing clinical observational data exists and 
those in which disease association is less robust; a dis- 
tinction that can influence both when and how to return 
results. Other concerns that have been voiced include 
the uncertainty surrounding regulations governing the 
return of genomics research results directly to partici- 
pants, the impact of false-positive and/or false-negative 
results, as well as the "incidentalome," 77 and in the con- 
text of commercial direct-to-consumer testing, the con- 
cern that obtaining results could lead to a "raiding of the 
medical commons." 78 

As new models of genomic research and commerce 
emerge, new mechanisms for communicating results to 
participants are also being explored. Many of these new 
models embrace a high level of involvement from their 
participants and, in return, may rely on some combina- 
tion of education, informed consent, and intermediation 
to return data in a responsible fashion. 79 
The public genomics model adopted by the PGP utilizes 
the first two approaches while foregoing the third, opting 
to return data directly to research participants without 
the required intervention of an intermediary. The advan- 
tages of direct data return and participant communica- 



tion are blunted by the partial shifting of the interpreta- 
tive burden from the clinician to the researcher. The PGP 
has approached this issue by focusing on data disclosure 
via the Preliminary Research Report (PRR), which con- 
tains a noncomprehensive list of genetic variants present 
in the participant's DNA sequence data currently 
thought to have a likelihood of clinical relevance among 
individuals possessing such variants. 
This preliminary identification of potentially significant 
variants is not intended to substitute in any way for pro- 
fessional medical advice, diagnosis or treatment. It lever- 
ages current knowledge by combining an evolving set of 
filtering algorithms and the use of existing variant data- 
bases — neither of which can be expected to have 100% 
accuracy in identifying truly pathogenic variants given 
the gaps in current scientific understanding. Participants 
are specifically instructed to confirm any potentially sig- 
nificant findings in consultation with their health care 
provider. It is possible that the increased rate of data 
return from public genomics research — as well as from 
commercial providers of personal genomic data — will 
help speed the creation of universal standards for clini- 
cal genomic interpretation that will help shift some of 
the interpretative burden back away from public 
genomics researchers. 

Outlook: the PGP from 10 to 100 000 

After publishing initial data from its first 10 participants 
in 2008, the PGP has continued to broaden the scope of 
the information it is collecting and publishing while 
simultaneously commencing the next stages of partici- 
pant enrollment. From exome to whole-genome 
sequence data, the development and release of the GET- 
EvidenceBase tool 80 for generation of Preliminary 
Research Reports, and the publication of substantial 
scholarship based on the PGP data generated to date, 
the project's progress has been substantial. The PGP is 
now supported by PersonalGenomes.org, a 501(c)(3) 
non-profit charity that coordinates the international 
efforts of the PGP with other collaborative public 
genomics research projects around the world. Both the 
PGP and PersonalGenomes.org continue to strive to 
develop and disseminate genomic technologies, pheno- 
typing strategies, and knowledge on a global scale and 
to produce tangible and widely available improvements 
in the understanding and management of human health 
in a responsible fashion. □ 
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>4vances en el genoma personal: 
desde el Proyecto Genoma Humano al 
Proyecto Genoma Personal 

El costo de una secuencia del genoma humano diploide se ha 
reducido desde cerca de 70 millones de dolares a 2000 dolares 
desde 2007, aunque los estandares de la redundancia han 
aumentado de 7 a 40 veces para mejorar los indices de 
demanda de genotipo. Junto con el bajo retorno de inversion 
para los polimorfismos de nucleotidos unicos comunes, esta 
situacion ha causado un aumento significativo del interes en 
correlacionar las secuencias gendmicas con una completa 
informacion ambientalyde rasgos (GAR). El costo de las fichas 
medicas electronicas, de las imagenes y de la informacion 
microbiologica, inmunologica y conductual tambien esta 
reduciendose rapidamente. El compartir tal conjunto de infor- 
macion ysus interpretaciones con una diversidad de investi- 
gadores y sujetos de investigacion pone de relieve la necesi- 
dad de contar con modelos de consentimiento informado 
capaces de estar orientados hacia nuevos temas de privacidad 
y otros, ademas de flexibilizar los recursos de datos compar- 
tidos que permitan disponer de materiales e informacion con 
minimas restricciones de uso. Este articulo examina el esfuerzo 
del Proyecto de Genoma Personal para desarrollar una base 
de datos de GAR como un recurso de genomica publica 
ampliamente accesible tanto a investigadores como a parti- 
cipantes de las investigaciones, respetando los estandares mas 
elevados de la etica de la investigacion. 
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